Detecting Information-Dense Texts in Multiple News Domains
نویسندگان
چکیده
We introduce the task of identifying information-dense texts, which report important factual information in direct, succinct manner. We describe a procedure that allows us to label automatically a large training corpus of New York Times texts. We train a classifier based on lexical, discourse and unlexicalized syntactic features and test its performance on a set of manually annotated articles from business, U.S. international relations, sports and science domains. Our results indicate that the task is feasible and that both syntactic and lexical features are highly predictive for the distinction. We observe considerable variation of prediction accuracy across domains and find that domain-specific models are more accurate.
منابع مشابه
Combining Lexical and Syntactic Features for Detecting Content-dense Texts in News
Content-dense news report important factual information about an event in direct, succinct manner. Information seeking applications such as information extraction, question answering and summarization normally assume all text they deal with is content-dense. Here we empirically test this assumption on news articles from the business, U.S. international relations, sports and science journalism d...
متن کاملReflection of Knowledge and Information Science’s News in the Press: A Case Study of Iran Newspaper
Background and Aim: The present study aims to explore the coverage and reflection of Knowledge and Information Science news in the Iranian press. Iran Newspaper which is one of the main public newspapers in the country has been selected as the case for this study. Method: This study used content analysis as its research methodology and adopted an inductive approach in data analysis. All the pag...
متن کاملInformation Extraction from Texts: Adapting a System for Summarization of News Reports to the Domain of Bioinformatics
Natural language serves as important information source in all areas of human activity. The presence of a huge amount of texts on the Internet actualizes the problem of efficient information search; visual scanning of all the textual information is difficult and timeconsuming. There is a need for efficient, highquality systems that extract the relevant information from texts. The paper presents...
متن کاملThe System of Engagement in a Sample of Prose Fiction and the News
Emerging within Systemic Linguistics, Appraisal/Evaluation is a framework for analyzing the language of evaluation, providing techniques for the systematic analysis of evaluation and stance as they operate in whole texts and in groupings of texts. There are three systems in the Appraisal framework: Attitude, Engagement, and Graduation. This study sets out to analyze the use of the system of Eng...
متن کاملCross-media Cross-genre Information Ranking Multi-media Information Networks
Current web technology has brought us a scenario that information about a certain topic is widely dispersed in data from different domains and data modalities, such as texts and images from news and social media. Automatic extraction of the most informative and important multimedia summary (e.g. a ranked list of inter-connected texts and images) from massive amounts of cross-media and cross-gen...
متن کامل